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Abstract 

Many  learning  tasks  have  subtasks  for  which 
much  training  data  exists.  Therefore,  we  want 
to  transfer  learning  from  the  old,  general- 
purpose  subtask  to  a  more  specific  new  task, 
for  which  there  is  often  less  data.  While  work 
in  transfer  learning  often  considers  how  the 
old  task  should  affect  learning  on  the  new 
task,  in  this  paper  we  show  that  it  helps  to 
take  into  account  how  the  new  task  affects  the 
old.  Specifically,  we  perform  joint  decoding  of 
separately-trained  sequence  models,  preserv¬ 
ing  uncertainty  between  the  tasks  and  allowing 
information  from  the  new  task  to  affect  predic¬ 
tions  on  the  old  task.  On  two  standard  text  data 
sets,  we  show  that  joint  decoding  outperforms 
cascaded  decoding. 

1  Introduction 

Many  tasks  in  natural  language  processing  are  solved  by 
chaining  errorful  sub  tasks.  Within  information  extrac¬ 
tion,  for  example,  part-of-speech  tagging  and  shallow 
parsing  are  often  performed  before  the  main  extraction 
task.  Commonly  these  subtasks  have  their  own  standard 
sets  of  labeled  training  data:  for  example,  many  large 
data  sets  exist  for  learning  to  extract  person  names  from 
newswire  text;  whereas  the  available  training  data  for  new 
applications,  such  as  extracting  appointment  information 
from  email,  tends  to  be  much  smaller.  Thus,  we  need  to 
transfer  regularities  learned  from  a  well-studied  subtask, 
such  as  finding  person  names  in  newswire  text,  to  a  new, 
related  task,  such  as  finding  names  of  speakers  in  email 
seminar  announcements. 

In  previous  NLP  systems,  transfer  is  often  accom¬ 
plished  by  training  a  model  for  the  subtask,  and  using  its 
prediction  as  a  feature  for  the  new  task.  For  example,  re¬ 
cent  CoNLL  shared  tasks  (Tjong  Kim  Sang  &  De  Meul- 
der,  2003;  Carreras  &  Marquez,  2004),  which  are  stan¬ 
dard  data  sets  for  such  common  NLP  tasks  as  clause  iden¬ 


tification  and  named-entity  recognition,  include  predic¬ 
tions  from  a  part-of-phrase  tagger  and  a  shallow  parser  as 
features.  But  including  only  the  single  most  likely  sub¬ 
task  prediction  fails  to  exploit  useful  dependencies  be¬ 
tween  the  tasks.  First,  if  the  subtask  prediction  is  wrong, 
the  model  for  the  new  task  may  not  be  able  to  recover.  Of¬ 
ten,  errors  propagate  upward  through  the  chain  of  tasks, 
causing  errors  in  the  final  output.  This  problem  can  be 
ameliorated  by  preserving  uncertainty  in  the  subtask  pre¬ 
dictions,  because  even  if  the  best  subtask  prediction  is 
wrong,  the  distribution  over  predictions  can  still  be  some¬ 
what  accurate. 

Second,  information  from  the  main  task  can  inform  the 
subtask.  This  is  especially  important  for  learning  trans¬ 
fer,  because  the  new  domain  often  has  different  charac¬ 
teristics  than  the  old  domain,  which  is  often  a  standard 
benchmark  data  set.  For  example,  named-entity  recog¬ 
nizers  are  usually  trained  on  newswire  text,  which  is  more 
structured  and  grammatical  than  email,  so  we  expect  an 
off-the-shelf  named-entity  recognizer  to  perform  some¬ 
what  worse  on  email.  An  email  task,  however,  often  has 
domain-specific  features,  such  as  Previous  Word  is 
Speaker:,  which  were  unavailable  or  uninformative  to  the 
subtask  on  the  old  training  set,  but  are  very  informative  to 
the  subtask  in  the  new  domain.  While  previous  work  in 
transfer  learning  has  considered  how  the  old  task  can  help 
the  new  task,  in  this  paper  we  show  how  the  new  task  can 
help  itself  by  improving  predictions  on  the  old. 

In  this  paper  we  address  the  issue  of  transfer  by  train¬ 
ing  a  cascade  of  models  independently  on  the  various 
training  sets,  but  at  test  time  combining  them  into  a  single 
model  in  which  decoding  is  performed  jointly.  For  the  in¬ 
dividual  models,  we  use  linear-chain  conditional  random 
fields  (CRFs),  because  the  great  freedom  that  they  allow 
in  feature  engineering  facilitates  the  learning  of  richer  in¬ 
teractions  between  the  subtasks.  We  train  a  linear  chain 
CRF  on  each  subtask,  using  the  prediction  of  the  previous 
subtask  as  a  feature.  At  test  time,  we  combine  the  learned 
weights  from  the  original  CRFs  into  a  single  grid-shaped 
factorial  CRF,  which  makes  predictions  for  all  the  tasks 
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at  once.  Viterbi  decoding  in  this  combined  model  im¬ 
plicitly  considers  all  possible  predictions  for  the  subtask 
when  making  decisions  in  the  main  task. 

We  evaluate  joint  decoding  for  learning  transfer  on  a 
standard  email  data  set  and  a  standard  entity  recognition 
task.  On  the  email  data  set,  we  show  a  significant  gain 
in  performance,  including  new  state-of-the-art  results.  Of 
particular  interest  for  transfer  learning,  we  also  show  that 
using  joint  decoding,  we  achieve  equivalent  results  to  cas¬ 
caded  decoding  with  25%  less  training  data. 


2  Linear-chain  CRFs 


Conditional  random  fields  (CRFs)  (Lafferty  et  ah,  2001) 
are  undirected  graphical  models  that  are  conditionally 
trained.  In  this  section,  we  describe  CRFs  for  the  linear- 
chain  case.  Linear-chain  CRFs  can  be  roughly  under¬ 
stood  as  conditionally-trained  finite  state  machines.  A 
linear-chain  CRF  defines  a  distribution  over  state  se¬ 
quences  s  =  {si,  .s*2 _ ...,  st}  given  an  input  sequence 

x  =  {xi,X2,  •  ■  • ,  xt}  by  making  a  first-order  Markov 
assumption  on  states.  These  Markov  assumptions  imply 
that  the  distribution  over  sequences  factorizes  in  terms  of 
pairwise  functions  $t(st_i,  st.  x)  as: 


p(»w = 

Z(x) 


(1) 


The  partition  function  Z(x)  is  defined  to  ensure  that  the 
distribution  is  normalized: 


Z(x)  = 

s'  t 

The  potential  functions  <I>t  (st-i ,  St ,  x)  can  be  interpreted 
as  the  cost  of  making  a  transition  from  state  st_i  to  state 
St  at  time  t,  similar  to  a  transition  probability  in  an  HMM. 

Computing  the  partition  function  Z(x)  requires  sum¬ 
ming  over  all  of  the  exponentially  many  possible  state 
sequences  s'.  By  exploiting  Markov  assumptions,  how¬ 
ever,  Z(x)  (as  well  as  the  node  marginals  p(st  |x)  and  the 
Viterbi  labeling)  can  be  calculated  efficiently  by  variants 
of  the  standard  dynamic  programming  algorithms  used 
for  HMMs. 

We  assume  the  potentials  factorize  according  to  a  set 
of  features  {_/>},  which  are  given  and  fixed,  so  that 


The  model  parameters  are  a  set  of  real  weights  A  =  { A;,. }, 
one  for  each  feature. 

Feature  functions  can  be  arbitrary.  For  example,  one 
feature  function  could  be  a  binary  test  fk(st-i,  St,  x,  t) 
that  has  value  1  if  and  only  if  St-i  has  the  label  Speak- 
erName,  st  has  the  label  Other,  and  the  word  xt  be¬ 
gins  with  a  capital  letter.  The  chief  practical  advantage 


of  conditional  models,  in  fact,  is  that  we  can  include  ar¬ 
bitrary  highly-dependent  features  without  needing  to  es¬ 
timate  their  distribution,  as  would  be  required  to  learn  a 
generative  model. 

Given  fully-labeled  training  instances  {(sj, 'x.j)}jLi, 
CRF  training  is  usually  performed  by  maximizing  the  pe¬ 
nalized  log  likelihood 

f(A)  =  ^  ^  '  ^kfk{sj,t—  1)  sj,ti  x>  t) 

j  t  k 

-Ei°g^(xi)-E^  (4) 

3  k 

where  the  final  term  is  a  zero-mean  Gaussian  prior  placed 
on  parameters  to  avoid  overfitting.  Although  this  maxi¬ 
mization  cannot  be  done  in  closed  form,  it  can  be  op¬ 
timized  numerically.  Particularly  effective  are  gradient- 
based  methods  that  use  approximate  second-order  infor¬ 
mation,  such  as  conjugate  gradient  and  limited-memory 
BFGS  (Byrd  et  ah,  1994).  For  more  information  on 
current  training  methods  for  CRFs,  see  Sha  and  Pereira 
(2003). 


3  Dynamic  CRFs 


Dynamic  conditional  random  fields  (Sutton  et  ah,  2004) 
extend  linear-chain  CRFs  in  the  same  way  that  dynamic 
Bayes  nets  (Dean  &  Kanazawa,  1989)  extend  HMMs. 
Rather  than  having  a  single  monolithic  state  variable, 
DCRFs  factorize  the  state  at  each  time  step  by  an  undi¬ 
rected  model. 

Formally,  DCRFs  are  the  class  of  conditionally-trained 
undirected  models  that  repeat  structure  and  parameters 
over  a  sequence.  If  we  denote  by  <l>c(yCi(,  xt)  the  repe¬ 
tition  of  clique  c  at  time  step  f,  then  a  DCRF  defines  the 
probability  of  a  label  sequence  s  given  the  input  x  as: 


P(s|x) 


lit  $c(yc,t,xt) 
Z{x) 


(5) 


where  as  before,  the  clique  templates  are  parameterized 
in  terms  of  input  features  as 


$c(yc,i,xt)  =  exp 


’'y  ^  ^ kfkiy c,t 

k 


(6) 


Exact  inference  in  DCRFs  can  be  performed  by 
forward-backward  in  the  cross  product  state  space,  if  the 
cross-product  space  is  not  so  large  as  to  be  infeasible. 
Otherwise,  approximate  methods  must  be  used;  in  our 
experience,  loopy  belief  propagation  is  often  effective 
in  grid-shaped  DCRFs.  Even  if  inference  is  performed 
monolithically,  however,  a  factorized  state  representation 
is  still  useful  because  it  requires  much  fewer  parame¬ 
ters  than  a  fully-parameterized  linear  chain  in  the  cross- 
product  state  space. 


Sutton  et  al.  (2004)  introduced  the  factorial  CRF 
(FCRF),  in  which  the  factorized  state  structure  is  a  grid 
(Figure  1).  FCRFs  were  originally  applied  to  jointly 
performing  interdependent  language  processing  tasks,  in 
particular  part-of-speech  tagging  and  noun-phrase  chunk¬ 
ing.  The  previous  work  on  FCRFs  used  joint  training, 
which  requires  a  single  training  set  that  is  jointly  labeled 
for  all  tasks  in  the  cascade.  For  many  tasks  such  data 
is  not  readily  available,  for  example,  labeling  syntac¬ 
tic  parse  trees  for  every  new  Web  extraction  task  would 
be  prohibitively  expensive.  In  this  paper,  we  train  the 
subtasks  separately,  which  allows  us  the  freedom  to  use 
large,  standard  data  sets  for  well-studied  subtasks  such  as 
named-entity  recognition. 


Figure  1:  Graphical  model  for  the  jointly-decoded  CRF. 
All  of  the  pairwise  cliques  also  have  links  to  the  observed 
input,  although  we  omit  these  edges  in  the  diagram  for 
clarity. 


4  Alternatives  for  Learning  Transfer 

In  this  section,  we  enumerate  several  classes  of  methods 
for  learning  transfer,  based  on  the  amount  and  type  of 
interaction  they  allow  between  the  tasks.  The  principal 
differences  between  methods  are  whether  the  individual 
tasks  are  performed  separately  in  a  cascade  or  jointly; 
whether  a  single  prediction  from  the  lower  task  is  used, 
or  several;  and  what  kind  of  confidence  information  is 
shared  between  the  sub  tasks. 

The  main  types  of  transfer  learning  methods  are: 

1.  Cascaded  training  and  testing.  This  is  the  traditional 
approach  in  NLR  in  which  the  single  best  prediction 
from  the  old  task  is  used  in  the  new  task  at  training 
and  test  time.  In  this  paper,  we  show  that  allowing 
richer  interactions  between  the  subtasks  can  benefit 
performance. 

2.  Joint  training  and  testing.  In  this  family  of  ap¬ 
proaches,  a  single  model  is  trained  to  perform  all  the 
subtasks  at  once.  For  example,  in  Caruana’s  work 
on  multitask  learning  (Caruana,  1997),  a  neural  net¬ 
work  is  trained  to  jointly  perform  multiple  classifica¬ 
tion  tasks,  with  hidden  nodes  that  form  a  shared  rep¬ 
resentation  among  the  tasks.  Jointly  trained  meth¬ 
ods  allow  potentially  the  richest  interaction  between 
tasks,  but  can  be  expensive  in  both  computation  time 
required  for  training  and  in  human  effort  required  to 
label  the  joint  training  data. 

Exact  inference  in  a  jointly-trained  model,  such 
as  forward-backward  in  an  FCRF,  implicitly  con¬ 
siders  all  possible  subtask  predictions  with  confi¬ 
dence  given  by  the  model’s  probability  of  the  pre¬ 
diction.  However,  for  computational  efficiency,  we 
can  use  inference  methods  such  as  particle  filtering 
and  sparse  message -passing  (Pal  et  al.,  2005),  which 
communicate  only  a  limited  number  of  predictions 
between  sections  of  the  model. 


3.  Joint  testing  with  cascaded  training.  Although  a 
joint  model  over  all  the  subtasks  can  have  better  per¬ 
formance,  it  is  often  much  more  expensive  to  train. 
One  approach  for  reducing  training  time  is  cascaded 
training,  which  provides  both  computational  effi¬ 
ciency  and  the  ability  to  reuse  large,  standard  train¬ 
ing  sets  for  the  subtasks.  At  test  time,  though,  the 
separately-trained  models  are  combined  into  a  sin¬ 
gle  model,  so  that  joint  decoding  can  propagate  in¬ 
formation  between  the  tasks. 

Even  with  cascaded  training,  it  is  possible  to  pre¬ 
serve  some  uncertainty  in  the  subtask’s  predictions. 
Instead  of  using  only  a  single  subtask  prediction 
for  training  the  main  task,  the  subtask  can  pass  up¬ 
wards  a  lattice  of  likely  predictions,  each  of  which 
is  weighted  by  the  model’s  confidence.  This  has  the 
advantage  of  making  the  training  procedure  more 
similar  to  the  joint  testing  procedure,  in  which  all 
possible  subtask  predictions  are  considered. 

In  the  next  two  sections,  we  describe  and  evaluate 
joint  testing  with  cascaded  training  for  transfer  learning 
in  linear-chain  CRFs.  At  training  time,  only  the  best 
subtask  prediction  is  used,  without  any  confidence  infor¬ 
mation.  Even  though  this  is  perhaps  the  simplest  joint- 
testing/cascaded-training  method,  we  show  that  it  still 
leads  to  a  significant  gain  in  accuracy. 

5  Composition  of  CRFs 

In  this  section  we  briefly  describe  how  we  combine 
individually-trained  linear-chain  CRFs  using  composi¬ 
tion.  For  a  series  of  N  cascaded  tasks,  we  train  indi¬ 
vidual  CRFs  separately  on  each  task,  using  the  prediction 
of  the  previous  CRF  as  a  feature.  We  index  the  CRFs 
by  i,  so  that  the  state  of  CRF  i  at  time  t  is  denoted  s\. 
Thus,  the  feature  functions  for  CRF  i  are  of  the  form 
fk(st- 1!  st>  st x’ t) — that  is,  they  depend  not  only  on 
the  observed  input  x  and  the  transition  — >  s\)  but 


Wt  =  w 

Wt  matches  [A-Z]  [a-z]  + 

Wt  matches  [A-Z]  [A-Z]  + 

Wt  matches  [A-Z] 

Wt  matches  [A-Z]  + 

Wt  matches  [A-Z ]  +  [a-z ]  +  [ A-Z ]  +  [a-z ] 

Wt  appears  in  list  of  first  names, 
last  names,  honorifics,  etc. 

Wt  appears  to  be  part  of  a  time  followed  by  a  dash 
Wt  appears  to  be  part  of  a  time  preceded  by  a  dash 
Wt  appears  to  be  part  of  a  date 

Tt  =  T _ 

qk  (x,  t  +  5)  for  all  k  and  S  £  [—4, 4] 


Table  1:  Input  features  t/fc(x,  t)  for  the  seminars  data.  In 
the  above  wt  is  the  word  at  position  t,  Tt  is  the  POS  tag 
at  position  t,  w  ranges  over  all  words  in  the  training  data, 
and  T  ranges  over  all  Penn  Treebank  part-of-speech  tags. 
The  “appears  to  be”  features  are  based  on  hand-designed 
regular  expressions  that  can  span  several  tokens. 

also  on  the  state  s\  1  of  the  previous  transducer. 

We  also  add  all  conjunctions  of  the  input  features  and 
the  previous  transducer’s  state,  for  example,  a  feature  that 
is  1  if  the  current  state  is  SpeakerName,  the  previ¬ 
ous  transducer  predicted  PersonName,  and  the  previ¬ 
ous  word  is  Host : . 

To  perform  joint  decoding  at  test  time,  we  form  the 
composition  of  the  individual  CRFs,  viewed  as  finite- 
state  transducers.  That  is,  we  define  a  new  linear-chain 
CRF  whose  state  space  is  the  cross  product  of  the  states 
of  the  individual  CRFs,  and  whose  transition  costs  are  the 
sum  of  the  transition  costs  of  the  individual  CRFs. 

Formally,  let  S'1,  S2, . . .  SN  be  the  state  sets  and 
A1,  A2, . . .  AN  the  weights  of  the  individual  CRFs.  Then 
the  state  set  of  the  combined  CRF  is  S  =  S1  x  S'2  x  . . .  x 
SN .  We  will  denote  weight  k  in  an  individual  CRF  i  by 
X‘k  and  a  single  feature  by  i,  s\,  s1^1,  x,  t).  Then 

for  s  £  S,  the  combined  model  is  given  by: 

IL exP {e*=i  J2k KfUNi  sr1> X- 1)\ 

P(S|x)  =  - - - Ay  \ - ~ 

Z(x) 

(7) 

The  graphical  model  for  the  combined  model  is  the  fac¬ 
torial  CRF  in  Figure  1 . 

6  Experiments 

6.1  Email  Seminar  Announcements 

We  evaluate  joint  decoding  on  a  collection  of  485  e-mail 
messages  announcing  seminars  at  Carnegie  Mellon  Uni¬ 
versity,  gathered  by  Freitag  (1998).  The  messages  are 
annotated  with  the  seminar’s  starting  time,  ending  time, 
location,  and  speaker.  This  data  set  has  been  the  sub¬ 
ject  of  much  previous  work  using  a  wide  variety  of  learn¬ 
ing  methods.  Despite  all  this  work,  however,  the  best 


Figure  2:  Learning  curves  for  the  seminars  data  set  on 
the  speaker  field,  averaged  over  10-fold  cross  validation. 
Joint  training  performs  equivalently  to  cascaded  decoding 
with  25%  more  data. 

reported  systems  have  precision  and  recall  on  speaker 
names  of  only  about  70% — too  low  to  use  in  a  practical 
system.  This  task  is  so  challenging  because  the  messages 
are  written  by  many  different  people,  who  each  have  dif¬ 
ferent  ways  of  presenting  the  announcement  information. 

Because  the  task  includes  finding  locations  and  per¬ 
son  names,  the  output  of  a  named-entity  tagger  is  a  use¬ 
ful  feature.  It  is  not  a  perfectly  indicative  feature,  how¬ 
ever,  because  many  other  kinds  of  person  names  appear  in 
seminar  announcements — for  example,  names  of  faculty 
hosts,  departmental  secretaries,  and  sponsors  of  lecture 
series.  For  example,  the  token  Host:  indicates  strongly 
both  that  what  follows  is  a  person  name,  but  that  person 
is  not  the  seminars’  speaker. 

Even  so,  named-entity  predictions  do  improve  per¬ 
formance  on  this  task.  We  use  the  predictions  from  a 
CRF  named-entity  tagger  that  we  trained  on  the  standard 
CoNLL  2003  English  data  set.  The  CoNLL  2003  data 
set  consists  of  newswire  articles  from  Reuters  labeled  as 
either  people,  locations,  organizations,  or  miscellaneous 
entities.  It  is  much  larger  than  the  seminar  announce¬ 
ments  data  set.  While  the  named-entity  data  contains 
203,621  tokens  for  training,  the  seminar  announcements 
data  set  contains  only  slightly  over  60,000  training  to¬ 
kens. 

Previous  work  on  the  seminars  data  has  used  a  one- 
field-per-document  evaluation.  That  is,  for  each  field,  the 
CRF  selects  a  single  field  value  from  its  Viterbi  path,  and 
this  extraction  is  counted  as  correct  if  it  exactly  matches 
any  of  the  true  field  mentions  in  the  document.  We  com¬ 
pute  precision  and  recall  following  this  convention,  and 
report  their  harmonic  mean  F-\ .  As  in  the  previous  work. 


|  System  j 

stime 

etime 

location 

speaker 

overall 

WHISK 

(Soderland,  1999) 

92.6 

86.1 

66.6 

18.3 

65.9 

SRV 

(Freitag,  1998) 

98.5 

77.9 

72.7 

56.3 

76.4 

HMM 

(Frietag  &  McCallum.  1999) 

98.5 

62.1 

78.6 

76.6 

78.9 

RAPIER 

(Califf  &  Mooney,  1999) 

95.9 

94.6 

73.4 

53.1 

79.3 

SNOW-IE 

(Roth  &  Wen-tau  Yih,  2001) 

99.6 

96.3 

75.2 

73.8 

86.2 

(LP)2 

(Ciravegna,  2001) 

99.0 

95.5 

75.0 

77.6 

86.8 

CRF  (no  transfer) 

This  paper 

99.1 

97.3 

81.0 

73.7 

87.8 

CRF  (cascaded) 

This  paper 

99.2 

96.0 

84.3 

74.2 

88.4 

CRF  (joint) 

This  paper 

99.1 

96.0 

85.3 

76.3 

89.2 

Table  2:  Comparison  of  F\  performance  on  the  seminars  data.  Joint  decoding  performs  significantly  better  than 
cascaded  decoding.  The  overall  column  is  the  mean  of  the  other  four.  (This  table  was  adapted  from  Peshkin  and 
Pfeffer  (2003).) 


we  use  10-fold  cross  validation  with  a  50/50  training/test 
split.  We  use  a  spherical  Gaussian  prior  on  parameters 
with  variance  a1  =  0.5. 

We  evaluate  whether  joint  decoding  with  cascaded 
training  performs  better  than  cascaded  training  and  de¬ 
coding.  Table  2  compares  cascaded  and  joint  decoding 
for  CRFs  with  other  previous  results  from  the  literature.1 
The  features  we  use  are  listed  in  Table  1 .  Although  previ¬ 
ous  work  has  used  very  different  feature  sets,  we  include 
a  no-transfer  CRF  baseline  to  assess  the  impact  of  transfer 
from  the  CoNLL  data  set.  All  the  CRF  runs  used  exactly 
the  same  features. 

On  the  most  challenging  fields,  location  and  speaker, 
cascaded  transfer  is  more  accurate  than  no  transfer  at  all, 
and  joint  decoding  is  more  accurate  than  cascaded  decod¬ 
ing.  In  particular,  for  speaker,  we  see  an  error  reduction 
of  8%  by  using  joint  decoding  over  cascaded.  The  differ¬ 
ence  in  FI  between  cascaded  and  joint  decoding  is  statis¬ 
tically  significant  for  speaker  (paired  t- test;  p  =  0.017) 
but  only  marginally  significant  for  location  ( p  =  0.067). 
Our  results  are  competitive  with  previous  work;  for  ex¬ 
ample,  on  location,  the  CRF  is  more  accurate  than  any  of 
the  existing  systems. 

Examining  the  trained  models,  we  can  observe  both 
errors  made  by  the  general-purpose  named  entity  tagger, 
and  how  they  can  be  corrected  by  considering  the  sem¬ 
inars  labels.  In  newswire  text,  long  runs  of  capitalized 
words  are  rare,  often  indicating  the  name  of  an  entity.  In 
email  announcements,  runs  of  capitalized  words  are  com¬ 
mon  in  formatted  text  blocks  like: 

Location:  Baker  Hall 

Host:  Michael  Erdmann 

In  this  type  of  situation,  the  named  entity  tagger  often 
mistakes  Host:  for  the  name  of  an  entity,  especially  be¬ 
cause  the  word  preceding  Host  is  also  capitalized.  On  one 
of  the  cross-validated  testing  sets,  of  80  occurrences  of 

*We  omit  one  relevant  paper  (Peshkin  &  Pfeffer,  2003)  be¬ 
cause  its  evaluation  method  differs  from  all  the  other  previous 
work. 


Wt  =  w 

Wt  matches  [A-Z]  [a-z]  + 

Wt  matches  [A-Z]  [A-Z]+ 

Wt  matches  [A-Z] 
wt  matches  [A-Z]  + 

Wt  matches  [ A-Z ]  +  [a-z ]  +  [A-Z ]  +  [ a-z ] 

Wt  is  punctuation 

Wt  appears  in  list  of  first  names,  last  names,  honorifics,  etc. 


gfc(x,  t  +  6)  for  all  k  and  S  £  [—2,  2J 
Conjunction  qk(x,  t )  and  qy  (x,  l)  for  all  features  k,  k! 
Conjunction  qt(x,  t )  and  qy  (x,  f  +  1)  for  all  features  k,  k' 


Table  3:  Input  features  qk(x,  t)  for  the  ACE  named-entity 
data.  In  the  above  wt  is  the  word  at  position  and  w 
ranges  over  all  words  in  the  training  data. 


the  word  Host:,  the  named-entity  tagger  labels  52  as  some 
kind  of  entity.  When  joint  decoding  is  used,  however, 
only  20  occurrences  are  labeled  as  entities.  Recall  that 
the  joint  model  uses  exactly  the  same  weights  as  the  cas¬ 
caded  model;  the  only  difference  is  that  the  joint  model 
takes  into  account  information  about  the  seminar  labels 
when  choosing  named-entity  labels.  This  is  an  example 
of  how  domain-specific  information  from  the  main  task 
can  improve  performance  on  a  more  standard,  general- 
purpose  subtask. 

Figure  2  shows  the  difference  in  performance  between 
joint  and  cascaded  decoding  as  a  function  of  training  set 
size.  Cascaded  decoding  with  the  full  training  set  of  242 
emails  performs  equivalently  to  joint  decoding  on  only 
181  training  instances,  a  25%  reduction  in  the  training 
set. 

In  summary,  even  with  a  simple  cascaded  training 
method  on  a  well-studied  data  set,  joint  decoding  per¬ 
forms  better  for  transfer  than  cascaded  decoding. 

6.2  Entity  Recognition 

In  this  section  we  give  results  on  joint  decoding  for  trans¬ 
fer  between  two  newswire  data  sets  with  similar  but  over¬ 
lapping  label  sets.  The  Automatic  Content  Extraction 
(ACE)  data  set  is  another  standard  entity  recognition  data 


Person  name 
Person  nominal 
Organization  name 
Organization  nominal 
GPE  name 
GPE  nominal 

Table  4:  Comparison  of  F\  performance  between  joint 
and  cascaded  training  on  the  ACE  entity  recognition  task. 
GPE  means  geopolitical  entities,  such  as  countries.  Joint 
decoding  helps  most  on  the  harder  nominal  (common 
noun)  references.  These  results  were  obtained  using  a 
small  subset  of  the  training  set. 

set,  containing  422  stories  from  newspaper,  newswire, 
and  broadcast  news.  Unlike  the  CoNLL  entity  recog¬ 
nition  data  set,  in  which  only  proper  names  of  entities 
are  annotated,  the  ACE  data  includes  annotation  both  for 
named  entities  like  United  States ,  and  also  nominal  men¬ 
tions  of  entities  like  the  nation.  Thus,  although  the  input 
text  has  similar  distribution  in  the  CoNLL  NER  and  ACE 
data  set,  the  label  distributions  are  very  different. 

Current  state-of-the-art  systems  for  the  ACE  task  (Flo- 
rian  et  al.,  2004)  use  the  predictions  of  other  named-entity 
recognizers  as  features,  that  is,  they  use  cascaded  trans¬ 
fer.  In  this  experiment,  we  test  whether  the  transfer  be¬ 
tween  these  datasets  can  be  further  improved  using  joint 
decoding.  We  train  a  CRF  entity  recognizer  on  the  ACE 
dataset,  with  the  output  of  a  named-entity  entity  recog¬ 
nizer  trained  on  the  CoNLL  2003  English  data  set.  The 
CoNLL  recognizer  is  the  same  CRF  as  was  used  in  the 
previous  experiment.  In  these  results,  we  use  a  subset  of 
10%  of  the  ACE  training  data.  Table  3  lists  the  features 
we  use.  Table  4  compares  the  results  on  some  represen¬ 
tative  entity  types.  Again,  cascaded  decoding  for  transfer 
is  better  than  no  transfer  at  al,  and  joint  decoding  is  better 
than  cascaded  decoding.  Interestingly,  joint  decoding  has 
most  impact  on  the  harder  nominal  references,  showing 
marked  improvement  over  the  cascaded  approach. 

7  Related  Work 

Researchers  have  begun  to  accumulate  experimental  ev¬ 
idence  that  joint  training  and  decoding  yields  better  per¬ 
formance  than  the  cascaded  approach.  As  mentioned  ear¬ 
lier,  the  original  work  on  dynamic  CRFs  (Sutton  et  al., 
2004)  demonstrated  improvement  due  to  joint  training  in 
the  domains  of  part-of-speech  tagging  and  noun-phrase 


Transfer  Type 


none 

cascaded 

joint 

81.0 

86.9 

87.3 

34.9 

36.1 

42.4 

53.9 

62.6 

61.1 

33.7 

35.3 

40.8 

78.5 

84.0 

84.0 

51.2 

54.1 

59.2 

chunking.  Also,  Carreras  and  Marquez  (Carreras  & 
Marquez,  2004)  have  obtained  increased  performance  in 
clause  finding  by  training  a  cascade  of  perceptrons  to 
minimize  a  single  global  error  function.  Finally,  Miller  et 
al.  (Miller  et  al.,  2000)  have  combined  entity  recognition, 
parsing,  and  relation  extraction  into  a  jointly-trained  sin¬ 
gle  statistical  parsing  model  that  achieves  improved  per¬ 
formance  on  all  the  subtasks. 

Part  of  the  contribution  of  the  current  work  is  to  sug¬ 
gest  that  joint  decoding  can  be  effective  even  when  joint 
training  is  not  possible  because  jointly-labeled  data  is  un¬ 
available.  For  example.  Miller  et  al.  report  that  they  orig¬ 
inally  attempted  to  annotate  newswire  articles  for  all  of 
parsing,  relations,  and  named  entities,  but  they  stopped 
because  the  annotation  was  simply  too  expensive.  In¬ 
stead  they  hand-labeled  relations  only,  assigning  parse 
trees  to  the  training  set  using  a  standard  statistical  parser, 
which  is  potentially  less  flexible  than  the  cascaded  train¬ 
ing,  because  the  model  for  main  task  is  trained  explicitly 
to  match  the  noisy  subtask  predictions,  rather  than  being 
free  to  correct  them. 

In  the  speech  community,  it  is  common  to  com¬ 
pose  separately  trained  weighted  finite-state  transducers 
(Mohri  et  al.,  2002)  for  joint  decoding.  Our  method  ex¬ 
tends  this  work  to  conditional  models.  Ordinarily,  higher- 
level  transducers  depend  only  on  the  output  of  the  previ¬ 
ous  transducer:  a  transducer  for  the  lexicon,  for  exam¬ 
ple,  consumes  only  phonemes,  not  the  original  speech 
signal.  In  text,  however,  such  an  approach  is  not  sensi¬ 
ble,  because  there  is  simply  not  enough  information  in 
the  named-entity  labels,  for  example,  to  do  extraction  if 
the  original  words  are  discarded.  In  a  conditional  model, 
weights  in  higher-level  transducers  are  free  to  depend  on 
arbitrary  features  of  the  original  input  without  any  addi¬ 
tional  complexity  in  the  finite-state  structure. 

Finally,  stacked  sequential  learning  (Cohen  &  Car¬ 
valho,  2005)  is  another  potential  method  for  combining 
the  results  of  the  sub  task  transducers.  In  this  general 
meta-learning  method  for  sequential  classification,  first 
a  base  classifier  predicts  the  label  at  each  time  step,  and 
then  a  higher-level  classifier  makes  the  final  prediction, 
including  as  features  a  window  of  predictions  from  the 
base  classifier.  For  transfer  learning,  this  would  corre¬ 
spond  to  having  an  independent  base  model  for  each  sub¬ 
task  (e.g.,  independent  CRFs  for  named-entity  and  sem¬ 
inars),  and  then  having  a  higher-level  CRF  that  includes 
as  a  feature  the  predictions  from  the  base  models. 

8  Conclusion 

In  this  paper  we  have  shown  that  joint  decoding  improves 
transfer  between  interdependent  NLP  tasks,  even  when 
the  old  task  is  named-entity  recognition,  for  which  highly 
accurate  systems  exist.  The  rich  features  afforded  by  a 
conditional  model  allow  the  new  task  to  influence  the  pre- 


dictions  of  the  old  task,  an  effect  that  is  only  possible  with 
joint  decoding. 

It  is  now  common  for  researchers  to  publicly  release 
trained  models  for  standard  tasks  such  as  part-of-speech 
tagging,  named-entity  recognition,  and  parsing.  This  pa¬ 
per  has  implications  for  how  such  standard  tools  are  pack¬ 
aged.  Our  results  suggest  that  off-the-shelf  NLP  tools 
will  need  not  only  to  provide  a  single-best  prediction,  but 
also  to  be  engineered  so  that  they  can  easily  communicate 
distributions  over  predictions  to  models  for  higher-level 
tasks. 
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